Communicate Data Findings

by Ayush Gupta

Introduction

Ford GoBike is the Bay Area's bike share system. Bay Area Bike Share was introduced in 2013 as a pilot program for the region, with 700 bikes and 70 stations across San Francisco and San Jose.

Ford GoBike, like other bike share systems, consists of a fleet of specially designed, sturdy and durable bikes that are locked into a network of docking stations throughout the city. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. People use bike share to commute to work or school, run errands, get to appointments or social engagements and more. It's a fun, convenient and affordable way to get around.

The bikes are available for use 24 hours/day, 7 days/week, 365 days/year and riders have access to all bikes in the network when they become a member or purchase a pass.

Dataset Overview

  1. Name: dataset.csv

  2. Source: https://www.lyft.com/bikes/bay-wheels/system-data

    The original link provided in the project (https://www.fordgobike.com/system-data) points to the above link.

  1. File versions: 01/2018 - 04/2019

    There are more data files available for the remaining months of 2019 but they are not used as there are some differences like different file names, additional fields, etc. which would require a lot of modification to be used with the remaining data.

In [5]:
# import all packages and set plots to be embedded inline
import warnings
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set_palette("viridis")
# suppress warnings from final output
warnings.simplefilter("ignore")
In [6]:
df = pd.read_csv("./data/processed/dataset_clean.csv")
df_station_names = pd.read_csv("./data/processed/df_station_names.csv")

for col in ["start_time", "end_time"]:
    df[col] = pd.to_datetime(df[col])

for col in ["start_station_id", "end_station_id"]:
    df[col] = df[col].astype('int')

for col in ["start_station_id", "end_station_id", "bike_id"]:
    df[col] = df[col].astype('str')

for col in ["user_type", "bike_share_for_all_trip"]:
    df[col] = df[col].astype('category')

kmeans = KMeans(n_clusters=3).fit(
    df_station_names[["station_longitude", "station_latitude"]])

df_station_names["label"] = kmeans.labels_

mapping = {0: "San Francisco", 1: "San José", 2: "East Bay"}

df_station_names["label_name"] = df_station_names["label"].map(mapping)
df_station_names.drop_duplicates(subset=["new_id"], inplace=True)

df = df.merge(df_station_names[["new_id", "label"]],
              left_on="start_station_id_new", right_on="new_id", how="outer")

df["label_name"] = df["label"].map(mapping)

df['month_year'] = pd.to_datetime(df["start_time"]).dt.to_period('M')

df['day_month_year'] = pd.to_datetime(df["start_time"]).dt.to_period('D')

df["dayofweek"] = df["start_time"].apply(lambda x: x.dayofweek)

df["start_hr"] = df["start_time"].apply(lambda x: x.hour)
df["end_hr"] = df["end_time"].apply(lambda x: x.hour)

Source: kepler.gl

Let's take a closer look on - San Francisco, East Bay and San José:

All Stations

San Francisco and East Bay

Upper Cluster

San José

Lower Cluster

Who are the people that are using this service?

Let's find out - at first we will look on the average trip duration.

In [7]:
fig, axes = plt.subplots(figsize=(12, 5), dpi=110)
n = 1
for i, x in enumerate(["San Francisco", "East Bay", "San José"]):
    df_new = df.query(f"label_name == '{x}'")

    bin_size = 100
    bins = np.arange(0, df_new.duration_sec.max()+bin_size, bin_size)

    plt.hist(df_new.duration_sec, bins=bins, label=x,
             color=sns.color_palette("viridis")[n], edgecolor="black", lw=0.4)
    n += 2

plt.xticks(ticks=[x for x in range(0, 7000, 250)])
plt.legend()
plt.xlim(-100, 3500)
plt.title("Frequency of trip durations per area in seconds")
plt.xlabel("Seconds")
plt.tight_layout()
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)

User Type

There are a lot more Subscribers than there are Customers. This suggests that there are a lot of people who use the service regularly either for work/school commute.

In [8]:
value_ct = df.user_type.value_counts().iloc[:31]

fig, ax = plt.subplots(figsize=(12, 5), dpi=110)
sns.countplot(x="user_type", data=df, order=value_ct.index,
              lw=0.5, edgecolor="black")

cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)

for p in ax.patches:
    ax.annotate('{:10.0f}%'.format(p.get_height()/(1906966+320033)
                                   * 100), (p.get_x()+0.31, p.get_height()+40000))

plt.title("Users By Type")
plt.xlabel("")
Out[8]:
Text(0.5, 0, '')

The next plots will focus on time components of our data.

What about trips per day?

It looks like the users use the bikes more frequently during the week than during the weekend.

In [9]:
fig, ax = plt.subplots(figsize=(12, 4), dpi=110)
sns.countplot(x="dayofweek", data=df, lw=0.5, edgecolor="black")

plt.tight_layout()
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
plt.title("Relative frequency of trips per day")
ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
plt.xlabel("")
plt.ylim(0, 500000)
for p in ax.patches:
    ax.annotate('{:10.0f}%'.format(p.get_height()/len(df)*100),
                (p.get_x()+0.1, p.get_height()+20000))

Trip Start Time

The most frequent starting hours are at 800hrs and at 1700hrs. Maybe people use it before and after work, which would make sense, because we have a lot of subscribers in working age in our dataset. You only subscribe to something, if you want to use it regularly. The integration into the working/study life would make sense here!

In [10]:
fig, ax = plt.subplots(figsize=(11, 5), dpi=110)

sns.countplot(x="start_hr", data=df, ax=ax, lw=0.5, edgecolor="black")

plt.tight_layout()
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
plt.title("Relative frequency of trips per starting hour")
plt.xlabel("Starting hour")
plt.ylim(0, 400000)
for p in ax.patches:
    ax.annotate('{:10.1f}%'.format(p.get_height()/len(df)*100),
                (p.get_x()-0.8, p.get_height()+15000))

ax.text(0-1.15, ax.patches[0].get_height()+13000,
        '{:10.1f}%'.format(ax.patches[0].get_height()/len(df)*100))
Out[10]:
Text(-1.15, 27911, '       0.5%')

If the average duration is dependent on the weekday.

The frequency of bike usage at the weekend is lower, but the average duration of each trip is greater than during the week!

In [11]:
# creating the legend object for the next plot
legend_obj = []

colors = [sns.color_palette("viridis")[0],
          sns.color_palette("viridis")[1],
          sns.color_palette("viridis")[2],
          sns.color_palette("viridis")[3],
          sns.color_palette("viridis")[4],
          sns.color_palette("viridis")[5],
          (163/255, 199/255, 70/255)]

days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

for i, s in enumerate(days):
    legend_obj.append(plt.scatter([], [], color=colors[i]))
In [12]:
fig, ax = plt.subplots(figsize=(12, 7), dpi=110)
sns.boxplot(x="dayofweek", y="duration_sec", data=df.groupby(
    ["dayofweek", "month_year"], as_index=False).mean())

plt.tight_layout()
sns.despine(fig, left=True)
plt.xlabel("")
plt.ylabel("Duration in seconds")
plt.title("Average trip duration per day")

ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]);

This trend applies for all areas, while we can also see that the users of San Francisco have, on average, the longest duration of trips, followed by East Bay and then San José.

In [13]:
fig, ax = plt.subplots(figsize=(12, 7), dpi=110)
sns.boxplot(x="dayofweek", y="duration_sec", data=df.groupby(
    ["dayofweek", "month_year", "label_name"], as_index=False).mean(), hue="label_name")
plt.tight_layout()
sns.despine(fig, left=True)
plt.xlabel("")
plt.ylabel("Duration in seconds")
plt.title("Average trip duration per day per area")

ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])

box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])

Average Starting Hour per Day

In [14]:
fig, ax = plt.subplots(figsize=(12, 7), dpi=110)
sns.boxplot(x="dayofweek", y="start_hr", data=df.groupby(
    ["dayofweek", "month_year"], as_index=False).mean())

plt.tight_layout()
sns.despine(fig, left=True)
plt.xlabel("")
plt.ylabel("Starting hour")
plt.title("Average starting hour per day")

ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]);

Looking at each area is interesting, because users from East Bay and San José are not only have shorter trip durations on average, but also they start their trips later than San Francisco on average.

In [15]:
fig, ax = plt.subplots(figsize=(12, 7), dpi=110)
sns.boxplot(x="dayofweek", y="start_hr", data=df.groupby(
    ["dayofweek", "month_year", "label_name"], as_index=False).mean(), hue="label_name")

plt.tight_layout()
sns.despine(fig, left=True)
plt.xlabel("")
plt.ylabel("Starting hour")
plt.title("Average starting hour per day per area")

ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]);

For the final visualizations, let's visualize the trips.

At first we will look at San Francisco.

San Francisco Trips

We can see that most of the trips are close to the beach.

Now for East Bay

Easy Bay Trips

Here the main routes are much more spread than in San Francisco. Also it looks like people use this service to quickly overcome smaller distances.

San José

San Jose Trips

For San José it looks spread over most of the stations.

Summary

A large number of people can benefit from this program:

  • Environmentally friendly, budget friendly, and lifestyle friendly.
  • Subscribers (i.e. daily commuters) benefit from a health commuting choice
  • Customers (i.e. tourists, students, etc.) have a sustainable, yet flexible option for touring the city.
  • Affordable and convenient transportation for the people of all socioeconomic classes
  • Renting a bike from the Ford GoBike System is a fantastic (healthy and environmentally friendly) way of moving around in the city, both for enjoyment and work.

There are two types of clients using the system: Subscribers and Customers. Subscribers are primarily daily commuters, having short trips to and from work, who rent a bike on weekdays at 8-9am and 5-6pm. Customers are usually tourists or occasional riders who use the system mainly on weekends to explore the Bay Area. The usage has been seen increasing in 2019. The service usage spikes during the summers and into the autumn season.